Automatic visual speech recognition is an interesting problem in patternrecognition especially when audio data is noisy or not readily available. It isalso a very challenging task mainly because of the lower amount of informationin the visual articulations compared to the audible utterance. In this work,principle component analysis is applied to the image patches - extracted fromthe video data - to learn the weights of a two-stage convolutional network.Block histograms are then extracted as the unsupervised learning features.These features are employed to learn a recurrent neural network with a set oflong short-term memory cells to obtain spatiotemporal features. Finally, theobtained features are used in a tandem GMM-HMM system for speech recognition.Our results show that the proposed method has outperformed the baselinetechniques applied to the OuluVS2 audiovisual database for phrase recognitionwith the frontal view cross-validation and testing sentence correctnessreaching 79% and 73%, respectively, as compared to the baseline of 74% oncross-validation.
展开▼